Data Analytics project on FIFA World Cup

I have taken data of World Cup from kaggle.com and uploaded the data on github.com, and from github.com I am pulling the data and reading it in Jupyter notebook. I am trying to understand to potential parameters in dataset based on which I can interpret data in different dimensions. After deciding on different parameters I started with cleaning data and filtering data which will be useful for analysis.I use different tools/module like Numpy, Pandas, Matplotlib, Seaborn, OS, urllib, Plotly and many inbuilt features.

Downloading the Dataset

I have taken data from kaggle.com of World Cup from 1986 to 2022 (10 Years of data)

Downloading and loading the dataset

Let's begin by downloading the data, and listing the files within the dataset.

Data Cleaning

Exploratory Analysis and Visualization

Data Analysis on FIFA World Cup taking numerous factors into consideration. Data Analyzed is past 10 years of World Cup data ie form 1986 to 2022.

Data Manipulation and Visualization

Grouping the data based on year to get correlated information related to years.

As we can see form dataset that data for home_xg(home expected goal) and away_xg (away expected goal) is available in 2018 and 2022 because of technology advancement, forecasting was possible.

It is clear over the year trend has been up and down but we can say that total attendance is overall increasing with years expect for 1994.

It is clear now after taking attendance per match in 1994 was highest. Also considering trend of attendance per match graph we can see there is slight increase in attendance per match over the years except for year 1994,1998 and 2002.

It is clear over the year trend has been up and down but we can say that total attendance is overall increasing with years expect when USA was host.

It is clear now after seeing attendance per match was highest when USA was host. Also considering attendance per match graph we can see that there is slight increase in attendance per match over the years exception being when USA; France; Korea,Japan were the host.

Maximum home goals scored was in 2022 when Qatar was host and max away goals scored was in 2014 when Brazil was host. As such there no trend with home goals and away goals

It is clear from above graph that home goals are dominated by away goals except in 2014.

It is clear from above graph that home goals are dominated by away goals except when Brazil was host nation.

There is no domination of either home penalty or away penalty and even distribution is also uneven, No certain conclusion can be made regarding home and away penalties. Penalties can go either ways.

There is no domination of either home penalty or away penalty and even distribution is also uneven, No certain conclusion can be made regarding home and away penalties. Penalties can go either ways.

Getting of different dataframe of Wins, Losses, and draws out of the dataNotes: Home wins and away loses when home score is greater than away score ie home score > away score Home loses and away wins when home score is less than away score ie home score < away score Draw happens when home score is equal to away score ie ie home score = away score

Team with most wins when regarded as home team is Brazil as they have won 29 times followed Germany and Argentina each with 26 wins.

Team with most wins when regarded as away team is Spain as they have won 14 times followed Germany and France each with 11 wins.

As we can see from the graph there are a lot more home wins than away wins. Brazil has 29 home wins and and 10 away wins, Total = 39 wins Germany has 26 home wins and 11 away wins, Total = 37 wins Argentina has 26 home wins and 5 away wins, Total = 31 wins France has 17 home wins and 11 away wins, Total = 28 wins Note: These wins are excluding penalty wins

Note: When draws occurs ie home score = away score the match goes in overtime and if in overtime the score still remains even then match goes in penalty kicks
from output of above cell we can understand that (114-110) ie 34 non null values and looking in above dataframe we can see that whenever the round column has group stage value that is when margin has nan value
In above data as we can see in "home_penalty","away_penalty" there are 0 non-null values ie All values are null or nan values. It can be concluded that there is no penalty in group stage of World Cup which is true in real life which speaks about credibility of the data.Now finding out when is data of home penalty win and away penalty win using margin.Note: When margin>0 home team wins when margin<0 away team wins

Argentina has record of highest penalty with 2 penalty wins at home and 4 (highest on away) penalty wins at away, total of 6 penalty wins.

Croatia also has good record in penalty wins with 2 wins at home and 2 at away, total of 4 penalty wins.

Brazil and Germany both have only won penalty as home team, having 3 penalty wins each.

Taking original dataset ie is data_df for manipulation because the new dataset created ie matches_df has filtered columns in it.
Note: When margin>0 home team wins when margin<0 away team wins
Here from the above manipulation we only have data of Winning team when teams wins a normal or in overtime but, what about teams winnning in peanlty kicks that why considering another parameter of penalty margin So, if home score > away score , Winner= home team if home score < away score , Winner= away team if home penalty score > away penalty score , Winner= home team if home penalty score < away penalty score , Winner= away team else Winner= Draw/no Winner, it happens when home score == away score Here home penalty score == away penalty score is not possible because the basic purpose of penalty kicks is to either of the teams to win so the above is not possible.

Most dominating team during the World Cup from 1986 to 2022 was Brazil as they have won 42 games followed by Germany with 40 wins and Argentina being third with 37 wins.

From above graph it is clear that attendance per round is highest in final followed by semi-finals and then quarter-finals which makes a lot of sense due to level of high of competition in those stages of World Cup.

From above graph it is clear that attendance per round is highest in final followed by semi-finals and then quarter-finals which makes a lot of sense due to level of high of competition in those stages of World Cup.

From the above graph goals per match is highest in Third-place match which is quite interesting and contrary to popular belief that most goals per match will be in Finals.

Also Round of 16 and group stage coming above the semi-finals and quarter-finals in term of goals scored is also quite intriguing observation.

May be the reason for less goals per matches in semis and quarters will be related to penalties goals being more

In group stage there are no penalties that why in above we can see in above graph penalty per match in group stage but, no penalty goals in third place match is quite alluring. That is chances of third-place match going in penalties is very very less.

In Quarter-finals stage penalty per match is highest having value 2.33 which is quite high and contrary to popular belief.

Now the picture is pretty clear when total goals per match graph is seen with most goals per match being Finals followed by quarter finals.

Creating a new datatset only considering finals of World Cup as it is most important and has most value.

Highest attendance in finals in accordance to graph is when host was Mexico even though we have previously seen that total attendance and attendance per match was highest when USA was host.

Lowest attendance in finals recorded when Korea,Japan (joint host) and Germany were the host

Highest attendance in finals in accordance to graph is in 1986 even though we have previously seen that total attendance and attendance per match was highest in 1994.

Lowest attendance in finals recorded in 2002 and 2006.

Just like Winner Column created in previous example based of margin and penalty margin Same is done to get Winning Manager and Winning Manager

Manager Luiz Felipe Scolari is the most successful manager having 16 wins managing Brazil followed by Didier Deschamps with 14 wins managing France and at third place is Joachim Löw with Germany having 12 wins.

Manager Carlos Alberto Parreira has managed two teams during his World Cup campaign and still managed to get 11 wins. The Two unique teams managed were South Africa,Brazil.

As from above data we can see that there is lot data missing for Winning_captain. Let take a deep look into the data Winning_captain is derivative of home_captain and away_captain. So let us also look into that.
As you can see from above table the values for captains ie home_captain and away_captain is missing from year 1998 to 2014. Now maybe something might have happened during the time of data manipulation and data went missing So, now take a look at original dataframe.
As you can see from above table the values for home_captain and away_captain is missing in the original original dataset as well asSo created a new dataset filling the empty values by myself.

As we see from the above table that data Winning_captain there are 604 values.

Player with most wins as Captain is Hugo Lloris, France picking up 14 wins under the leadership of Hugo Lloris.

Second player with most wins as Captain is Lionel Messi, Argentina picking up 13 wins under his leadership

Third player with most wins as Captain is Diego Maradona, Argentina picking up 12 wins under his leadership

Teams with most World Cup are Argentina, France, and Brazil each team with 2 World Cup trophy in past 10 World Cups from 1986 to 2022.

Note: In score column in contains normal score as well as penalty score For Example 1st row in dataset having "Year" as "2022" and "Host" as "Qatar" as has Score (4)3-3(2),"Argentina" is "home_team" and "France" is "away_team". Score written normally without bracket represent normal score ie Argentina 3 - 3 France as can be seen in home_score and away_score columns.Because of the score being 3-3 the match goes in penalty shootout. Score in brackets represent penalties scored in match that is match went in penalty shoot out and the score in penalty shoot out was Argentina(4) - (2)France and Argentina Won the WOrld Cup in penalty shootout because Argentina score 4 goals and France only scored 2 goals in penalty shootout. The same can be seen in "Notes" column in above dataframe.

Sudden spike in goals when USA,Germany,and Qatar were host can be explained due to fact that matches went to penalties as can be seen in data frame given above.

Trend of goals in final game is kind of all over the place but we can say it is increasing if consider the sudden spike are due to penalties.

Sudden spike in goals in year 1994,2006,2022 can be explained due to fact that matches went to penalties as can be seen in data frame given below.

Trend of goals in final game is kind of all over the place but we can say it is increasing if consider the sudden spike are due to penalties.

importing new dataset which contains the data of top scorer and goals scored by them in Each World Cup

Teams with most appearance in finals are Argentina, France, Brazil. Each team has qualified for finals of World Cup for 4 times as can be seen in above graph.

The below given dataset of World Cup Winner is the most important derivative from the above dataset.

Asking and Answering Questions

Section 1: Analysis of participant of World Cup from 1986 to 2022

Q.1: How many distinct Countries have participated in World Cup from 1986 to 2022?

Q.2: World Cup is dominated by which Continents?

Teams with most appearance in World Cup
It is clear from the graph that World Cup is dominated by European teams and some south american teams.

Bird Eye view of distinct countries and their appearance in World Cup on a World Map

It is clear from the Map that World Cup is not only dominated by European teams but also South American teams as well.

Lets look at teams in asia participating in World Cup

Q.3: Which Asian team have participated in World Cup?

From Asia continent as well there is a lot participation. Here Dominant Teams are South Korea(Korea Republic), Japan ,and Saudi Arabia

Q.4: Which African team have participated in World Cup?

Here Dominating countries/teams were Cameroon, Nigeria, Morocco

Q1: What is trend of Attendance in the World Cup from 1986 to 2022?

Trend of Total Attendance in World Cup based on Host

It is clear over the year trend has been up and down but we can say that total attendance is overall increasing with years expect for in 1994 when USA was host.

Trend of Total Attendance in World Cup based on Years

Despite having less No of matches played is 1994 when USA was host it recorded highest attendance when compared to attendance form 1986 to 2022.

This isn't a fair comparison for trends as no of matches played in early years were less.

Q.2: What is trend of Attendance per match?

As no of matches played in are different in case of different years becasue of different no of teams playing in the competitions. So we look at average attendance in order to understand the trends

It is clear now after taking attendance per match in 1994 attendance was highest when USA was host. Also considering attendance per match graph we can see there is slight increase in attendance per match over the years except in 1994 when USA was host.

Q.3: How does trend of Attendance per match in different stages of World Cup looks like?

Now, lets look trend of attendance in World Cup Group Wise

From above two graphs it is clear that attendance per round is highest in final followed by semi-finals and then quarter-finals which makes a lot of sense but attendance in third-place match is slightly less when considering the importance of match.

Q.1: Is there any significant advantage to goal scoring when a team plays as home team or away team?

It is clear from above two graphs that home goals are dominated by away goals meaning chance of team winning as home team is slightly high as compared when team is playing as away team.

Only in 2014 when host nation was Brazil the away team goals dominated home team goals which is quite intriguing.

Q.2: In which year was total goal scored by home team and away team maximum and is there any trend ?

As such there no trend with home goals but we can see that from 1990 number of home goals have increased and peaked in 1998 and then start decreasing till 2010 and then again started increasing. Same can be said about away goals as well.

Total home goals scored was maximum in 2022 when host was Qatar whereas total away goals scored was maximum in 2014 when host was Brazil.

Q.3: Is there any significant advantage for home team and away team for penalty goals?

No trend can identified considering graph of home penalty and away penalty. Distribution is all over the place and no exact prediction can made on winning of home team or away team. Penalties here are also like 50-50 as in real life. So there is clearly no advantage to one team when it comes to penalties.

Q.4: What is trend of normal goals and penalty goals in World Cup?

Trend of Total goals scored in World Cup from 1986 to 2022

From above two graphs we can see that in 1990 when Italy normals goals where less as compared to other years and penalty goals where more as compared to other years World Cup. Therefore we can say that chance of matches going to penalties is high as compared to other WOrld Cups.

Q.5: Over the years is there any significant increase in total goals scored?

Trend here suggest that over the year total no of goals scored are increasing but as we move ahead year wise that from year 2014 when Brazil there is only slight increase in total no of goals scored when compared to 2022 when host Qatar.

Note:Total goals scored here is Normal goals + Penalty goals

Now, lets look some wise group wise distribution of goals

Q.6: Is there any influence on goals scored per match when taking into account different stage/round of World Cup?

Here from graph we can see that most goals per match round wise is highest in third-place match which is quite interesting because we expect that most goals in a match would be higher in either in finals as there is more competition in finals or in group stage as teams aren't concerned about scoreline in this stage.

Above graph give us very different perspective as you would expect highest penalty per match in final or semi-finals due to high level competition at that stage but most penalty scored per match is highest in Quarter-finals.

According to graph highest total goals per match is in finals which make sense because of high level of competition at that stage but second highest turns out to be in quarter-finals which quite contradictory to our thought process because after the finals, the semi-finals is most competitive stage.

Section 4: Analysis of Final Stage of World Cup

Q.1: Is there any significant effect of digitization on Attendance of World Cup in finals?

Dataframe of Final matches in World Cup from 1986 to 2022
Trend of Attedance in World Cup Finals

Highest attendance in finals in accordance to graph is in 1986 when host was Mexico even though we have previously seen that average attendance was highest in 1994 when host by USA. These average attendance being high in 1994 when USA was host can be related to sudden spike in attendance in finals when host by USA in 1994 as can be seen in graph.

Q.2: Is there any significant increase in no. of goals scored in finals of World Cup over the years?

Trend of Total goals scored ie normals and penalties in World Cup Finals

Sudden spike in goals in year 1994,2006,2022 can be explained due to fact that matches went to penalties as can be seen in data frame given below.

Trend of goals in final game is kind of all over the place but we can say it is increasing if consider the sudden spike are due to penalties.

Q.3: Which team has the most appearance in finals of World Cup from 1986 to 2022?

Teams with most appearance in Finals

Teams with most appearance in finals are Argentina, France, Brazil. Each team has qualified for finals of World Cup for 4 times as can be seen in above graph.

Q.4: Which team has won most World Cup from 1986 to 2022?

Teams having won the World Cup/Teams with most World Cup wins

Teams with most World Cups from 1986 to 2022 are Argentina, France, Brazil. Each team has won World Cup 2 times as can be seen in above graph.

Section 5: Analysis of triumphs during the World Cup from 1986 to 2022

Team with most wins when playing as Home team

Most Successful team when team is regarded as Home team

Most Successful home team is Brazil as they have won 29 times when regarded as Home team.

Team with most wins when playing as Away team

Most Successful team when team is regarded as Away team

Most Successful away team is Spain as they have won 29 times when regarded as Home team.

Q.1: Which Team has dominated the World Cup by winning the most matches between 1986 and 2022?

Most Successful team/Team with most wins in World Cup from 1986 to 2022

Most Successful team during the World Cup from 1986 to 2022 is Brazil as they have win 42 games in whole campaign, won the World Cup 2 times, and have appeared in finals 4 times.

Q.2: During the world cup campaigns from 1986 to 2022, which manager had the most wins?

Most Successful Manager/Manager with most wins in World Cup with teams managed

Manager Luiz Felipe Scolari is the most successful manager in World Cup from 1986 to 2022. He also has won the World Cup with Brazil one time with Top Scorer as Ronaldo (aka R9) whose was also managed by Luiz Felipe Scolari.

Q.3: During the world cup campaigns from 1986 to 2022, which player as Captain had most success?

Most Successful player as captain in World Cup with their Nations

Most Successful Captain during the World Cup Campaign of 1986 to 2022 is Hugo Lloris who plays for France. He also lead the team to win World Cup recently in Year 2018 when host was Russia as Captain of the team France.

Q.4: Which players have won golden boots in World Cup over the years?

World Cup Finals

Over the years top scorer have been different from previous years because world cup occurs after every 4 year. Ronaldo (aka R9) and Kylian Mbappe both have scored 8 goals for their nation which is highest in history of World Cup as can be seen from the graph.

Golden boots is a award given to player who has scored most goals in World Cup in that year. So all Player/Top Scorer in the graph have won the golden boots in that particular year.

Inferences and Conclusion

There are many conclusion which can be drawn using this dataset. The conclusion mentioned below are listed in sections.

Probability of World Cup Finals going in overtime or penalties is 50.00%.

Probability of World Cup Finals going in penalties is 30.00%.

Probability of World Cup Finals ending in overtime/extratime is 20.00%.

South Korea made history by becoming the first member of the Asian Football Confederation (AFC) to get to a FIFA World Cup semifinal.

At the 2022 FIFA World Cup, Morocco exceeded all predictions by winning their group, which also included the 2018 runners-up Croatia, and by defeating elite opponents like Belgium, Spain, and Portugal.

Morocco became the first country from Africa to ever get to the semifinals.

On June 22, 2002, South Korea defeated European giants Spain 5-3 (0-0) in the quarterfinal in penalty shootout, making history in the process.

The 2002 World Cup was the first and only time the competition was jointly staged by two nations, South Korea and Japan once.

The World Cup held in Qatar in 2022 was the first time when a Middle East Nation hosted and just the second time in Asia (the previous time was in 2002).

Number of unique teams/countries participated in world cup from the year 1986 to 2022 are: 77.

Teams with most appearance in World Cup is Germany with 58 matches played followed by Brazil with 57 and then Argentina with 54.

World Cup with Most Attendance was in year 1994 when host was USA with Total Attendance being 3,587,538 or 3.58 million.

World Cup with most Attendance per match was also in year 1994 when host was USA with Attendance per match was approximately 68,992 or 68K.

World Cup with most Attendance in Finals was in year 1986 when Mexico was the host with attendance in finals being 114,600 or 114K.

Most nerve-racking Final

Most goals scored in Final of World Cup is 12 goals being scored in 2022 when Qatar was host. This finals was the most nerve-racking finals in the history of World Cup because of high level of competition, 12 goals were scored in a single match, and match going in overtime only to be decided by penalties.

One of the unluckiest teams at the World Cup is Mexico. They've lost more games than any other team in tournament history.

Team with most qualification for finals and team winning most World Cups were Argentina, Brazil, France. Each team appearing in finals for 4 times and winning 2 times.

In Conclusion the most Successful team during the World Cup campaign of 1986 to 2022 was Brazil as they won 42 games in whole campaign, won the World Cup 2 times, and have appeared in finals 4 times.

The biggest margin of victory in World Cup was achieved at the 2002 World Cup, jointly hosted by Japan and South Korea, when Germany defeated Saudi Arabia 8-0.

Most Successful Manager

Manager Luiz Felipe Scolari is the most successful manager during the World Cup from 1986 to 2022.

He managed to win 16 games as manager during the whole campaign of 1986 to 2022.

He also won the World Cup with Brazil one time with Top Scorer as Ronaldo (Brazilian) who was also managed by him.

Most Successful player as Captain.

Most Successful Captain during the World Cup Campaign of 1986 to 2022 is Hugo Lloris who plays for France.

As a Captain he managed to win 14 games which highest as compared any other captain.

He also lead the team to win World Cup recently in Year 2018 when host was Russia as Captain of the team France.

References and Future Work

Reference

Link to github repository - https://github.com/meetth77/jovian_practice

Link to kaggle website(from where data is taken)- https://www.kaggle.com/datasets/piterfm/fifa-football-world-cup

Link to Images:

Image 1:https://images8.alphacoders.com/128/1288503.jpg

Image 2:https://mir-s3-cdn-cf.behance.net/project_modules/max_1200/763bd417699179.562bdb4d800a4.jpg

Image 3:https://upload.wikimedia.org/wikipedia/commons/9/95/Poland_Senegal_2018-06-19_02.jpg

Image 4:https://w0.peakpx.com/wallpaper/693/934/HD-wallpaper-fifa-world-cup-russia-2018-fifa-world-cup-russia-2018-games-games-football-fifa.jpg

Image 5:https://www.aljazeera.com/wp-content/uploads/2022/10/GettyImages-1227824947.jpg?resize=770%2C513&quality=80

Image 6:https://i.pinimg.com/564x/4d/8d/a1/4d8da14badd023b10cbfd4d0df33825c.jpg

Future Work

Here the data taken is from 1986 to 2022 ie last 10 World Cup so, in future I would consider the whole dataset that is from 1930 to 2022 and Analyze the whole data get insight from that data.

In future we can use the data of attendance in order to get revenue from ticket selling.

Also the data of attendance, home goals, away goals, home penalty, away penalty and top scorer can be used in order to predict upcoming World Cup goals and attendance and how successful will be the Upcoming World Cups.

Based on the data provided we can find out different Probability and Statistics related to World Cup which will be helpful in future predictions.

Also this data analytics project has future scope to Sport Analytics.

Saving the data